K-means
Contents
K-means#
Introduction#
Perhaps one of the most popular clustering methods is the K-means method. K-means is often referred to as Lloyd’s algorithm.
The uses of K-means
The uses of K-means include:
Search engine: Search engine, groups results together using clustering algorithm
Customer segmentation: K-means clustering can be used to create customer clusters based on demographic information, geographical information and behavioral data
Social network analysis: To find groups of people with specific interest to direct the personalized ads
Data center: To organize the computer clusters in data center
Inventory management: Create inventory clusters based on sales number and manufacturing capacity
The main idea of the method is the iterative repetition of two steps:
distribution of sample objects into clusters;
recalculation of cluster centers.

How K-Means works#
Let’s begin by randomly picking some starting points in our world of features. We call these points ‘Cluster Centers,’ and we’ll denote them as
Each of our data points, labeled j (where j goes from 1 to m), finds its home in the cluster whose center is the closest. Think of it like finding the cozy spot nearest to you!
Here’s how we decide that: we use a formula,
which is like saying, ‘Which cluster center (µ) is the closest to our data point (x_j)?’
The interesting part is that we’re on a mission to find the best spots for our Cluster Centers. We keep shuffling our data points between the clusters and adjusting the Cluster Centers until things settle down. It’s like arranging furniture in a room until everyone’s comfy. This shuffle-and-adjust dance is our way of making sure our clusters are just right.
Now, imagine two main moves in our dance. First, the ‘Assignment Step’: we assign each data point to its closest Cluster Center. Picture it as deciding where each piece of furniture should go based on proximity. Then comes the ‘Move Centroid Step’: we recalculate the Cluster Centers by finding the average position of all the data points in each cluster. It’s like finding the center of mass for each cluster.
Next, we need to figure out how many clusters we want. Let’s call this number ‘K.’ We randomly place K points in our world and call them ‘Cluster Centroids.’ Initially, these centroids are like tourists wandering around randomly. But as we keep reassigning and adjusting, these centroids find their true homes at the heart of their clusters.
We keep going through the shuffle-and-adjust routine until our Cluster Centers decide, ‘This is it, we’re not moving anymore!’ Once they’re content, we’ve successfully ‘converged’ our K-Means algorithm. The data points now belong to their respective clusters.
The formula
is our way of fine-tuning the Cluster Centers during each adjustment. It’s like finding the sweet spot that minimizes the total distance between data points and their cluster center.
Fig. 1 K-means scheme#
Selecting an Initial Approximation#
The first question when choosing the initial position of centers is how, when choosing centers from some random distribution, not to end up in a region of the feature space where there are no sampling points. The basic solution is to simply select some of the sample objects as centers.
The second potential problem is the crowded placement of centers. In this case, their initial position will most likely be far from the final position of the cluster centers. For example, for such initial positions of the centers we will get incorrect clustering.
To combat this phenomenon, it is advantageous to take centers that are as far apart from each other as possible.
In practice, the following methods work:
The first center is selected randomly from a uniform distribution at the sampling points;
We select each subsequent center from a random distribution on sample objects, in which the probability of choosing an object is proportional to the square of the distance from it to the nearest cluster center.
Note
A modification of K-means that uses this heuristic to select initial guesses is called K-means++.
What K-means optimizes#
Let’s talk on an intuitive level about what optimization problem K-means solves.
Both steps of the algorithm work to reduce the average square of the Euclidean distance from objects to the centers of their clusters:
At the step of assigning objects to one of the clusters, we select the cluster with the nearest centroid, that is, we minimize each term in \(\Phi_0\): we set all potentially large terms to zero, and leave only the smallest possible ones non-zero (provided the cluster centers are fixed).
At the next step, the step of recalculating cluster centers, we choose a center in such a way that, for a fixed set of objects belonging to the cluster, for all \(k\) minimize the expression under the sum by \(k\) :
Here it becomes fundamental that we define the square of the distance as the square of the difference of vectors, since it is from here that when differentiating with respect to \(\mu_k\) and writing down the necessary extremum condition, it turns out that the cluster centers need to be recalculated as arithmetic means \(x_i\) belonging to the cluster.
EXAMPLE: Mall Customer Segmentation#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/Mall_Customers.csv')
Understanding The Data#
There are 200 unlabelled training examples in total, and we’ll utilize annual income and spending score to identify clusters within the data.
It’s important to note that the spending score, ranging from 1 to 100, is assigned by the mall based on customer behavior and spending habits.
plt.figure(figsize=(10,6))
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Unlabelled Mall Customer Data')
Text(0.5, 1.0, 'Unlabelled Mall Customer Data')
Random Initialization Trap#
Since we have to randomly pick the cluster centroids, it’s initialization may affect the final outcome of the clustering. In case our initialization is not correct, then K-Means algorithm may form a cluster with few points only. Such situation is referred as centroid random initialization trap and it may cause algorithm to get stuck at local optima.
Tip
To avoid random initialization trap, follow below guidelines for random initialization:
Number of cluster centroids should be less than number of training examples
To avoid local optima issue, try to do multiple random initialization of centroids
Multiple random initialization technique is more effective when we have a small number of clusters
Similarly for large number of clusters, few random initialization are sufficient
Choosing The Number of Clusters#
So using random initialization we can avoid the local optima issue, but to choose how many clusters to look for in a given data we can use below methods.
Elbow Method#
In Elbow method we run the K-Means algorithm multiple times over a loop, with an increasing number of cluster choice (say from 1 to 10) and then plotting a clustering score as a function of the number of clusters. Clustering score is nothing but sum of squared distances of samples to their closest cluster center. Elbow is the point on the plot where clustering score (distortion) slows down, and the value of cluster at that point gives us the optimum number of clusters to have. But sometimes we don’t get clear elbow point on the plot, in such cases its very hard to finalize the number of clusters.
Advantages and Disadvantages of the Elbow Method
Advantages
One of the simplest algorithm to understand
Since it uses simple computations it is relatively efficient
Gives better results when there is less data overlapping
Disadvantages
Number of clusters need to be defined by user
Doesn’t work well in case of overlapping data
Unable to handle the noisy data and outliers
Algorithm fails for non-linear data set
X = df.iloc[:, [3,4]].values
clustering_score = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'random', random_state = 42)
kmeans.fit(X)
clustering_score.append(kmeans.inertia_) # inertia_ = Sum of squared distances of samples to their closest cluster center.
plt.figure(figsize=(10,6))
plt.plot(range(1, 11), clustering_score)
plt.scatter(5,clustering_score[4], s = 200, c = 'red', marker='*')
plt.title('The Elbow Method')
plt.xlabel('No. of Clusters')
plt.ylabel('Clustering Score')
plt.show()
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Warning
To create the following plot we used inertia_(Sum of squared distances of samples to their closest cluster center) From the elbow plot choose after which cluster WCSS(Y axis) slows down.
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='random', max_iter=300, n_init=10, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method on Iris Dataset')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()
Visualization Of KMeans#
kmeans= KMeans(n_clusters = 5, random_state = 42)
kmeans.fit(X)
pred = kmeans.predict(X)
df['Cluster'] = pd.DataFrame(pred, columns=['cluster'] )
plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 300, c = 'red', label = 'Centroid', marker='*')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Customer Clusters')
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Text(0.5, 1.0, 'Customer Clusters')
Inner Working#
Note
By using the elbow method we found the optimal number of clusters. Code below create visualization where you can slide the widget to select the number of clusters. The K-means algorithm is executed accordingly with the selected number of clusters, and the resulting clusters are visualized along with their centroids. This helps users observe how different cluster counts affect the grouping of data points.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from ipywidgets import interact, IntSlider
# Generate random data
df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/Mall_Customers.csv')
X = df.iloc[:, [3,4]].values
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
predict = kmeans.fit_predict(X)
def kmeans_interactive_X(num_clusters):
# Fit KMeans model
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
return X[:,0]
def kmeans_interactive_Y(num_clusters):
# Fit KMeans model
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
return X[:,1]
def cluster_centers_X(num_clusters):
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
return kmeans.cluster_centers_[:, 0]
def cluster_centers_Y(num_clusters):
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
return kmeans.cluster_centers_[:, 1]
def predict(num_clusters):
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
return kmeans.fit_predict(X)
fig = go.Figure()
fig.add_trace(go.Scatter(x=X[:, 0], y=X[:, 1], mode='markers', marker=dict(color=kmeans.labels_, size=10, opacity=0.7), name = 'predicted label'))
fig.add_trace(go.Scatter(x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:, 1], mode='markers', marker=dict(color='red', size=12, symbol='star'), name='Centroids'))
fig.update_layout(title=dict(text='KMeans Clustering with different clusters', x=0.5, xanchor='center'), xaxis_title='Annual Income', yaxis_title='Spending Score', width=800, # Set the width of the figure
height=600)
fig.update_layout(sliders=[
{
'steps': [
{
'method': 'animate',
'label': str(components),
'args': [
[str(components)],
{
'mode': 'immediate',
'frame': {'duration': 500, 'redraw': True},
'transition': {'duration': 300}
}
]
}
for components in range(2, 7)
],
'currentvalue': {'prefix': 'Number of Clusters: '}
}
])
frames = [go.Frame(data=[
go.Scatter(
x=kmeans_interactive_X(components),
y=kmeans_interactive_Y(components),
mode='markers',
marker=dict(color=predict(components), size=10, opacity=0.7),
),
go.Scatter(
x=cluster_centers_X(components),
y=cluster_centers_Y(components),
name = 'Centroids',
mode='markers',
marker=dict(color='red', size=12, symbol='star'),
)
], name=str(components)) for components in range(2, 7)]
fig.frames = frames
fig.show()
EXAMPLE: University Income depending on Student count#
plt.figure(figsize=(7,5))
plt.scatter(pd.to_numeric(df['num_students']),pd.to_numeric(df['income']), s=10, c='skyblue', edgecolors='purple')
plt.xlabel('Student Number')
plt.ylabel('Income')
plt.xlim(1000, 60000)
plt.title('University income depending student number')
plt.show()
Before clustering, we prepare our dataset: We clean the data by removing rows with empty spaces, ensuring complete information. Then, we select specific details relevant to our task.
def clean_dataset(df):
df.dropna(inplace=True)
return df
df = clean_dataset(df)
X = df.values[:, [7,9]]
X = np.nan_to_num(X)
X
array([['83.7', 2243.0],
['87.5', 11074.0],
['64.3', 15596.0],
...,
['44.0', 31268.0],
['40.4', 10117.0],
['39.8', 8663.0]], dtype=object)
After convergence, k-means assesses cluster quality by summing squared errors (SSE). The code enables manual iteration and cluster count adjustments. Using ‘random’ initialization, centroids might not accurately represent the data. Increasing iterations shows centroid accuracy and cluster distribution changes.
class Graph1:
def __init__(self, X):
self.X = X
self.iter_num = 1
self.cluster_num = 2
self.colors = ['blue', 'darkorange', 'yellow', 'green', 'violet', 'brown']
self.fig, self.ax = plt.subplots()
self.create_default_graph()
def increase_iteration_number(self, _):
self.iter_num += 1
self.update_cluster()
def increase_cluster_num(self, change):
if str(change.new).isnumeric():
self.cluster_num = int(change.new)
self.update_cluster()
def create_default_graph(self):
plt.ion()
self.ax.set_xlabel('Income')
self.ax.set_ylabel('Student number')
self.ax.set_ylim(100, 50000)
self.ax.set_xlim(0, 100)
self.ax.set_xticks(np.arange(0, 100, step=10))
self.ax.set_title('Income depending to student number')
self.fig.canvas.draw_idle()
def update_cluster(self):
kmeans = KMeans(init='random', n_clusters=self.cluster_num, n_init=10, max_iter=self.iter_num)
kmeans.fit(self.X)
pred = kmeans.predict(self.X)
self.ax.clear()
self.create_default_graph()
for i in range(self.cluster_num):
self.ax.scatter(self.X[pred == i, 0], self.X[pred == i, 1], s=20, c=self.colors[i], label='Cluster '+str(i), edgecolors='k')
self.ax.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s=100, c='red', label='Centroids', marker='*')
self.fig.legend()
Metric selection#
Since K-means works by sequentially repeating two steps until convergence, the validity of using different metrics or proximity functions depends on whether they “break” any of these steps or not.
The first step of assigning objects to the nearest centers does not depend on the type of metric. The second step involves recalculating the centers as the arithmetic mean of the points included in the cluster, and here there is a catch: it is the Euclidean metric that leads to the optimal choice of centers in the arithmetic mean.
Mini-batch K-means#
It is easy to see that, if we consider \(K\) and the dimension of the feature space are constants, both steps of the algorithm work in \(O(n)\), where n is the number of objects in the training sample. This is where the idea of speeding up the algorithm comes from. In mini-batch K-means, we do not count steps on the entire sample at once, but at each iteration we select a random subsample (mini-batch) and work on it. In the case when the initial sample is very large, switching to batch processing does not lead to a large loss of quality, but significantly speeds up the operation of the algorithm.
K-means on MNIST#
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
%config InlineBackend.figure_format = 'svg'
X, y = fetch_openml('mnist_784', return_X_y=True)
X = X.astype(float).values / 255
y = y.astype(int).values
C:\Users\Admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\datasets\_openml.py:1022: FutureWarning:
The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
Apply mini-batch K-means:
from sklearn.cluster import MiniBatchKMeans
kmeans_mini = MiniBatchKMeans(n_clusters=10, n_init=10)
%time kmeans_mini.fit(X)
print("Intertia:", kmeans_mini.inertia_)
print("Class labels:", kmeans_mini.labels_)
Wall time: 1.28 s
Intertia: 2766243.9537299247
Class labels: [0 4 6 ... 8 9 7]
Calculate silhouette score:
from sklearn.metrics import silhouette_score
%time silhouette_score(X, kmeans_mini.labels_, metric='euclidean')
Wall time: 1min 36s
0.05651340628318254
Now plot the cluster centers:
plt.figure(figsize=(10, 10))
for i in range(10):
plt.subplot(3, 4, i+1)
plt.xticks([])
plt.yticks([])
plt.imshow(kmeans_mini.cluster_centers_[i].reshape(28, 28), cmap='gray')
plt.title(f"Label index: {i}", size=15)
Can you guess who is who here?
Now take the true K-means.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, n_init=10)
%time kmeans.fit(X)
print("Intertia:", kmeans.inertia_)
print("Class labels:", kmeans.labels_)
Wall time: 38.6 s
Intertia: 2744522.462060745
Class labels: [4 2 0 ... 5 4 7]
Silhouette score of K-means:
Important
See Clustering Metrics for information on Silhouette score
from sklearn.metrics import silhouette_score
%time silhouette_score(X, kmeans.labels_, metric='euclidean')
Wall time: 1min 36s
0.055995177407123176
Once again plot the centers of clusters:
plt.figure(figsize=(10, 10))
for i in range(10):
plt.subplot(3, 4, i+1)
plt.xticks([])
plt.yticks([])
plt.imshow(kmeans.cluster_centers_[i].reshape(28, 28), cmap='gray')
plt.title(f"Label index: {i}", size=15)
Resources:
Skoltech lecture on clusterizaion
Kaggle Tutorial: K-Means Clustering